To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy

There is an increasing number of medical use cases where classification algorithms based on deep neural networks reach performance levels that are competitive with human medical experts. To alleviate the challenges of small dataset sizes, these systems often rely on pretraining. In this work, we aim to assess the broader implications of these approaches in order to better understand what type of pretraining works reliably (with respect to performance, robustness, learned representation etc.) in practice and what type of pretraining dataset is best suited to achieve good performance in small target dataset size scenarios. Considering diabetic retinopathy grading as an exemplary use case, we compare the impact of different training procedures including recently established self-supervised pretraining methods based on contrastive learning. To this end, we investigate different aspects such as quantitative performance, statistics of the learned feature representations, interpretability and robustness to image distortions. Our results indicate that models initialized from ImageNet pretraining report a significant increase in performance, generalization and robustness to image distortions. In particular, self-supervised models show further benefits to supervised models. Self-supervised models with initialization from ImageNet pretraining not only report higher performance, they also reduce overfitting to large lesions along with improvements in taking into account minute lesions indicative of the progression of the disease. Understanding the effects of pretraining in a broader sense that goes beyond simple performance comparisons is of crucial importance for the broader medical imaging community beyond the use case considered in this work.


Introduction
The role of computer vision algorithms based on deep learning in medical imaging in the form of decision support systems has increased steadily in the past few years [1][2][3][4][5][6][7]. There is an enormous amount of data that is being produced on a daily basis from different areas using different imaging modalities such as MRI, CT, microscopy, etc., leading to an unprecedented potential for machine learning algorithms. However, while there exists a lot of data, it is usually not prepared to be used for research in machine learning. In particular, it is often unlabeled as the labeling process is expensive and time-consuming or sometimes medical experts may not agree on the appropriate label.
A practitioner using Deep Neural Networks (DNN) for the task of medical imaging, is faced with a plethora of options when it comes to the training methodology for the DNN. Several factors can influence the decision making process including, but not limited to the size, noise level and quality of the dataset at hand, computational resources available and robustness of the trained DNN. Transfer learning, i.e. pretraining models on a large corpora of natural images has been found to be beneficial for improvements in performance along with speeding up convergence on downstream tasks such as medical imaging [1,8]. A straightforward way of utilizing transfer learning is to finetune a model that has been initially trained on ImageNet [9] on the medical dataset.
Other common state-of-the-art methods in machine learning are supervised-learning methods, i.e. models that are trained with labeled data, opposed to other methods that require only some or even no labeled data such as semi-supervised or self-supervised learning. Fortunately, the field of self-supervised learning has recently advanced significantly [12][13][14][15], which gives rise to hope for a successful deployment of machine learning in medical applications without relying on overly large amounts of labeled data. A first result in this regard was obtained in [6,16,17] where the authors showed that pretraining using self-supervision helps to improve the models for chest x-ray classification [18], dermatology condition classification [19] and COVID-19 deterioration prediction [17].
With widespread adoption of transfer learning in medical imaging, it becomes essential to explore the differentiating features of the various training methodologies-supervised or selfsupervised. [1] observe the effects of pretraining on the speed of convergence and feature representations learned, but only in a supervised learning setting. [8] find that pretrained models from ImageNet provide improvements in quality of the features learned performance as well as improvements in performance on diverse downstream datasets. Despite the benefits of transfer learning, it has however remained unclear what transfer learning, especially with self-supervised learning actually exploits when making a prediction. For this (as we will see) simply looking at performance metrics like classification accuracy or area under the operating curve (AUC) is not sufficient. The potential advantages of using self-supervised methods over supervised methods for medical imaging beyond such performance metrics thus remain a challenging object of study.
In this contribution, we demonstrate for diabetic retinopathy (DR) as a particular medical imaging use case, that going beyond metrics of predictive performance is mandatory. We further analyze robustness to statistical variations of the data. Furthermore we validate previous results on smaller data sets which are of ubiquitous interest to practitioners in medical data science.
To this end, we perform a detailed study of what is being learned by the different training methodologies available to train a DNN for medical imaging. Broadly, the training methodologies will be categorized into two types: • Fully supervised (FS) • Self-supervised with contrastive learning (CL) along with two types of initialization of the weights before training on the medical dataset:

PLOS ONE
• Initialization with no external data (IWNE) • Initialization from ImageNet (IFI) The focus of this paper is to study the effects of training the DNN using these strategies and evaluate the benefits. Fig 1 gives an overview of our contributions which are as follows: 1. We evaluate the performance of the four different training strategies: supervised and selfsupervised models using models trained with or without using external data for pretraining in detecting diabetic retinopathy in retinal images. We find that IFI helps in achieving significant gain in performance, especially when a limited amount of the downstream (medical) labeled dataset is used. IFI-CL provides a further increase in performance.
2. Given that IFI is beneficial in terms of performance, we investigate what makes them better by analyzing the eigenvalue spread of the activations on the hidden layers. We find that the redefined conditioning number for the IFI models is lower than that of IWNE models for the initial layers that are important for learning diverse and effective feature representations from the input. IFI makes the eigenvalue spread of the activations of the first hidden layer broader, implying that a wider range of kernels fire for a given input. In both IWNE as well as IFI models, we show that CL achieves broader eigenvalue spread compared to its supervised counterparts.
3. Using explainability of DNNs, we investigate what the different models look at in the input for making a decision. With the help of ground-truth segmentation maps available for diabetic retinopathy on the IDRiD challenge [11], we study in a quantitative manner what information was used by the models to make the prediction. We find that IWNE-FS overfits to large lesions like hard exudates and ignores smaller lesions to predict the disease. IFI models show significantly reduced tendency to overfit to one particular type of lesions. Especially IFI-CL is able to consider a wider range of lesions to make an accurate prediction for the disease. Overview of the experiments presented in this work. a) shows the different pretraining strategies: Initialization from ImageNet (IFI) [9] and Initialization without any external data (IWNE), i.e. pretraining only on Eyepacs-1 datasets [10]. Such a pretraining step can be performed either in a supervised or a self-supervised manner. This is followed by finetuning on the Eyepacs-1 dataset. b) investigates the statistics of the eigenvalues of the feature representations learned by the different methods which lead to increased robustness to distortions. c) shows the experiments we perform using the Indian Diabetic Retinopathy Image Dataset (IDRiD) challenge data [11] to quantitatively evaluate the cues learned. https://doi.org/10.1371/journal.pone.0274291.g001

PLOS ONE
To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy

Diabetic retinopathy
DNNs have seen wide adoption for the task of DR assessment in [2,3, among others. While some methods train their model from scratch [20,21,32,35,43], IFI models have predominantly achieved higher performance [2,3,23,26,30,40]. Some methods also perform their training on large private data [2,20,24,29,33]. A reproduction study of [2] was performed by [3] showing difficulty in achieving similar performance for DR when trained on publicly available datasets. Systematic study of using uncertainty measures for DR were also conducted by [43,44]. While [22] studied the probability maps with ground-truth segmentation maps to ascertain what the DNN prediction was looking for [45], studied a computerassisted setting with explanation methods for deep learning models in grading for DR. There is, however, no dedicated study on the implications of different training methodologies.

Supervised vs. self-supervised learning
Self-supervised learning has been utilized in a wide range of biomedical applications including chest x-rays [4][5][6]17], diabetic retinopathy [47,48], COVID-19 detection [17] etc. In spite of the improvements shown by self-supervised learning [49], find that self-supervised models behave quite similarly to their supervised counterparts in many aspects of robustness. Self-supervised models report a slightly higher performance gain over their supervised counterparts on medical imaging [4,6]. Recent works show the generalizing capabilities of self-supervised learning on chest x-rays [50]. The improvements and benefits still need to be rigorously investigated to ascertain the limits of using self-supervised learning on real-life healthcare applications.

IWNE vs IFI
Pretraining on ImageNet dataset (i.e. IFI), either supervised or self-supervised, is considered an effective strategy [4-6, 8, 51-56]. Several benefits have been attributed to pretraining including robustness [8,[51][52][53][54], generalization [57,58], finding sparser subnetworks from the original [59] and also speed up in convergence on the downstream task [1,8]. Using IFI for DR has been widely adopted owing to benefits in performance [1-3, 23, 26, 32, 60]. The performance benefits of pretraining have been observed even on diverse datasets which seem distant from the ImageNet dataset [8]. The benefits of pretraining can be attributed to effective feature extracting capability of pretrained models in the lower layers [1,8]. Although, it is unclear how this translates to a DNN being used for a downstream task after finetuning. While the above mentioned methods investigate supervised learning, we make a comparative study of IWNE vs IFI along with FS vs CL and their combinations to understand their differentiating features.

Datasets
We focus on diabetic retinopathy (DR) as a use case for our investigations and solely work on publicly available datasets, which are summarized in Table 1.

PLOS ONE
To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy We make use of the Eyepacs-1 dataset [10], which is available from a former Kaggle challenge. The images are graded from a scale of 0 to 4 (0: no DR, 1: mild DR, 2: moderate DR, 3: severe DR, 4: proliferative DR) according to the International Clinical Diabetic Retinopathy (ICDR) severity scale. DR advances from a healthy eye to a proliferate one slowly and may also take years. However, this transition is discrete and often goes undetected to worsen into a proliferate DR. Hence, it is essential that this progression is detected and a timely medical diagnosis is performed. In our experiments, we train the models to perform the quinary classification using all the five grades. During inference, we formulate the outputs predicted by the model to a binary classification by summing up the output neurons corresponding the the two labels, i.e. healthy classes [0-2] and disease classes [3][4]. Following the summation, we apply softmax activation to map the outputs to the range of [0, 1] to obtain output probabilities. This binary class formulation is consistent with referable DR (rDR) classification in [2,3].
The Eyepacs-1 dataset [10] consists of 35216 images in the training set and 53576 in the test set. We utilize non-overlapping set of around 15% of the training set as the validation set. We train all our different methods on the training set of Eyepacs-1 dataset and evaluate the performance of the models on two datasets-test set of Eyepacs-1 and Messidor-2 [46]. Messidor-2 dataset [46] is a benchmark dataset consisting of 1744 images that are 100% gradable. The evaluation on the Messidor-2 dataset is supposed to measure the generalization performance of the algorithms since the dataset is not used for training and was collected under different conditions, at a different geographical location and with different hardware. Hence, we use all the images of this dataset for testing. We report the AUC for the binary rDR classification task on the respective test sets of each dataset.

Models & training procedures
We compare the four training setups which are eventually trained on the DR target dataset.

• Initialization With No External Data (IWNE)
• FS: supervised training on the DR dataset starting from randomly initialized weights.
• CL: self-supervised pretraining on the target domain and finetuning also on the same dataset using labeled data.
• CL: self-supervised pretraining on ImageNet dataset and finetuning on the DR dataset using labeled data.
For comparability, we fix the architecture and use a Resnet50 [61] model for all our experiments. In the self-supervised setting, we pretrain the models using MoCoV2 strategy [62]. For the supervised pretraining, we use the ImageNet-pretrained model provided by torchvision. The IWNE models are trained for 500 epochs with a learning rate of 10 −4 . Pretrained models have shown to be faster at convergence than the models trained from scratch [1,8]. Hence, we finetune the IFI models starting from ImageNet-pretrained weights for 50 epochs with a learning rate of 10 −3 . The IFI models use the same mean and standard deviation of the ImageNet dataset while IWNE models use mean and standard deviation computed from the training set of the Eyepacs-1 dataset. The AdamW optimizer [63] with weight decay was used in all the settings. The best models in each training run was chosen based on the maximum AUC score achieved on the validation set and this model was used for inference on the test.

Quantitative performance
We evaluate the performance of the different methods discussed in Section Models & Training Procedures in terms of AUC. Each model was trained on the full dataset and on various fractions of the training set down to a fraction of 10% labeled samples. Fig 2 shows the final AUC of the binary classification for rDR. We find largely consistent results in terms of the ranking and overall behavior of the different training procedures between evaluation on a subset of the Eyepacs-1 dataset used for training and an evaluation on the external Messidor-2 dataset, which is a reassuring sign that our results generalize across datasets. The best-performing method across all the training set fractions is IFI-CL, i.e. finetuning a model that was trained in a self-supervised fashion on ImageNet data, closely followed by IFI-FS, corresponding to the standard training methodology in medical imaging, where a model pretrained on Ima-geNet is finetuned on the target dataset. The results for the IWNE-CL model, i.e. self-supervised pretraining in target (DR) domain are weaker than the former two results. This trend is again followed at lower training set fractions where the model is trained with reduced fractions of the labeled dataset. A training set fractions of 1.0 corresponds to training with the entire training set of 30, 000 images, while a fraction of 0.1 corresponds to 3, 000 images. While IWNE models deteriorate in performance, IFI models show only a marginal drop as shown in Fig 2. The results clearly advocate the use of IFI models as opposed to not using external data, which is in line with most part of the medical imaging literature but at first sight contradicts [1], who found no improvements from IFI as compared to direct training on a considerably larger closed source DR dataset. The inferior results of IWNE-CL compared to IFI-CL can potentially be attributed to two factors: the size of Eyepacs-1 as pretraining is with around 30k samples, very small compared to large natural image datasets, such as ImageNet with 1.2M images, where self-supervised contrastive methods were demonstrated to work really well. In addition, for IWNE-CL we used the same set of transformations proposed for ImageNet in

PLOS ONE
To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy [13], which certainly represents a suboptimal choice for the DR images that differ qualitatively from natural images and the pretraining algorithm is rather sensitive to this choice.

Statistics of eigenvalues Condition number.
To better understand what makes the IFI models achieve higher performance, we study the activations of the hidden layers. In particular, we compute the eigenvalues of the activations of each layer in the four models we considered. Using the eigenvalues, we plot the condition number [64] as shown in Fig 3a. To prevent the condition number from having very large values due to division by the minimum of the eigenvalues, we redefine the condition number as follows: We find in Fig 3a that the condition number for IFI models is much lower than that of IWNE implying significantly more diverse features learned. Also, in both versions of initializations, we find that the condition number for self-supervised learning is lower than that of supervised learning in the initial layers. This indicates that self-supervised learning extracts more diverse features than its supervised counterparts. We also find in Fig 3a that for all the different models, the condition number is flattened out and becomes indistinguishable for the latter layers. The initial layers form the crux of the learning process extracting effective and diverse feature representations while the latter layers learn to aggregate these features. On the other hand, the final layers are responsible for the discriminative classification, thus reducing the diversity here can be beneficial. We also observe this phenomenon in Fig 3a, where the conditional number of IFI models in comparison to IWNE models increase in the final layers, indicating loss in diversity that in turn leads to superior performance as reported in Fig 2. Spread of eigenvalues. To investigate the distinctive aspects of the initial layers, we plot the eigenvalues of the first layer for all four models in Fig 3b. The eigenvalues are made symmetrical around 0 and plotted in the form of density to make for better visualization. The bottom row in Fig 3b also zooms in on the tails. We find that the IWNE models obtain high and peaked eigenvalues in comparison to IFI models. In addition to lower peak values, the IFI models show heavy-tailedness in comparison to that of IWNE models. Similar to the findings in the experiments on the condition number, self-supervised learning in contrast to supervised learning shows a slightly lower peak value. Additionally, in both versions of the initialization, self-supervised learning models show more heavy tailedness.
The results indicate that IWNE models learn kernels in the first convolutional layer that are activated for some very specific patterns. On the contrary, IFI models learn kernels that activate for a broader range of input features. The superior performance of IFI models can be attributed to this effect while this may also lead to several other benefits including increase in generalization and robustness.
Distribution fitting. In this section, we fit the eigenvalues of the first convolutional layer to the parameters of several distributions and report the distribution that fits best [65]. Among a wide range parameterized distributions, we find in Table 2 that all the four models fit best to the Pareto distribution, though the parameters vary. Pareto distribution with the shape parameter α = 1.16 corresponds to the 80−20 rule, implying that 80% of the results come from 20% of the causes [66]. IWNE models show α values higher than 1. 16. This indicates that the overall result comes from less than 20% of the activations. In other words, the kernels learned by the IWNE models extract small number of, yet highly curated set of features from the input. In contrast, we find that IFI brings down the value of α for the Pareto distribution implying a wider range of feature representations learned by the first convolutional layer. Additionally, in

PLOS ONE
To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy both versions of initializations, CL shows reduced value of α when compared to FS indicating that the kernels learned by CL methods fire on a further broader range of input.
Our studies show that pretraining and self-supervised learning is beneficial for the downstream medical imaging task to be able learn kernels that fire broadly and in turn extract more diverse and effective features from the input.

Robustness to distortions
The heavy-tailed activation statistics in combination with ReLU-thresholding in Section Statistics of Eigenvalues showed that a larger number of neurons are capable of detecting structures in the input when the input data is varied according to sampling from the dataset. One can expect that this also may translate to an increased detection capability when input samples are varied by data augmentation parameters towards zones of lower data density. We have performed this experiment for the IWNE and IFI models by distorting the input with a set of predefined distortions as shown in [67].
One can see from Fig 4 that for the majority of distortion cases, the score for the self-supervised model is higher, indicating a higher robustness to the respective distortions. There is a marked difference between IWNE and IFI models. In the former case, CL always provides an Table 2. Distribution fitting for the eigenvalues of the activations of the first layer. For all the four models, the eigenvalues are best parametrized by a Pareto distribution. We also find that the self-supervised models show smaller value for the shape parameter of the Pareto distribution.  Bottom row shows the difference for IFI models. In case of IWNE, the difference is consistently positive, implying that the self-supervised model has a higher prediction score than the plainly supervised model and thus exhibits a higher robustness to distortions. See Section Robustness to Distortions: for a detailed discussion.

PLOS ONE
To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy increase in robustness in comparison to FS. Using IFI in the latter case is known to provide good generalization for finetuning with respect to a wide range of target datasets. This improved generalization levels the difference between FS and CL. However IFI-CL still improves robustness for different noise types, pixelation and lower levels of saturation changes. Note the conspicuous outlier in IFI for JPEG compression.

Quantitative analysis of learned cues
Explainability for DNN reveals what the model looks at on the image to make the prediction [68][69][70][71][72][73][74][75][76][77][78][79]. Using ground-truth segmentation masks, explanations have been evaluated to show quantitatively if what the model is looking at, is relevant for making the decision [80]. In the case of DR, a reasonable expectation is that the trained model looks at lesions in the retina that are indicative of the disease in order to make its decision. In order to evaluate the explanation heatmaps, we use the dataset of IDRiD [11] containing detailed pixel-wise annotation of the different lesions that contribute to the disease. The dataset consists of 80 images with segmentation masks for microaneurysms, haemorrhages and hard exudates. The IDRiD dataset also contains segmentation maps for soft exudates for a smaller subset of images, which we excluded from our quantitative evaluation.
To obtain explanation heatmaps, we utilize Layer-wise Relevance Propagation (LRP) [70,74]. LRP is a principled approach to decompose the decisions of the classifier and assign pixelwise relevances determining the contributions of the input pixels towards the decision. The layer-wise conservation principle in LRP assures that the relevances from a higher layer is preserved when propagated to a lower layer The forward pass for the activations of any given layer in a DNN can be defined as be the weighted activation of neuron i onto neuron j in the next layer. Let z ij ¼ a l i w ðl;lþ1Þ ij , where a l i is the activation of a neuron i in the previous layer, and where z ij is the contribution of neuron i at layer l to the activation of the neuron j at layer l + 1. The relevances are computed using the α 1 β 0 rule: The intuition behind LRP is that neurons of the lower layers that mostly contribute to the activations of the higher layer neuron receive a larger share of the relevance R j of the neuron j.
Decomposing the relevances into its positive part z þ ij and the negative part z À ij allows for exact conservation of the relevances [69]. The bottom row shows the explanation heatmaps by using the different training methods. By comparing each result to the total marked in red in Fig 5, we can evaluate the effectiveness of the model in looking at the lesion to make the prediction. We find that explanation heatmaps from IWNE overfit on the hard exudates and show minimal correlation with the other lesions. On the other hand, explanation heatmaps from IFI models are significantly more outspread correlating better with different lesions.
The correlation of explanation heatmaps to the ground-truth segmentation maps also helps us make a quantitative evaluation of how accurately the models relies on the disease to make its prediction. We follow the evaluating strategies adopted in [80] including relevance mass accuracy and relevance rank accuracy. For a given input RGB image x, relevances R i determining the importance of the input features x i are also in the dimensions of the image. However, the the ground truth segmentation mask S � [0, 1] are only in two dimensions. Hence, we pool the relevances across the channels to be able to compare them with the segmentation masks. We utilize the two pooling strategies followed by [80]: • sum pos : R pool ¼ maxð0; where C is the number of channels. However, the findings here are agnostic to the pooling strategy utilized. Given pooled relevances and ground truth segmentation masks, the relevance mass accuracy is defined as: where the numerator corresponds to the sum of relevances where the ground truth segmentation maps exists and the denominator is the sum of all relevances. The relevance rank accuracy is defined as: where R pool p i is the relevances in the top i th percentile. While RMA corresponds to the precision, RRA corresponds to the recall. Table 3 shows the results for RMA and RRA for the explanation heatmaps correlated with the ground-truth segmentation maps from the IDRiD challenge. We report the accuracies for each lesion-microaneurysms, haemorrhages and hard exudates and a total, where we combine the above mentioned lesions. The heatmaps for each of the methods are computed by backpropagating from the output neuron corresponding to severe DR, which can also be considered as the ground truth DR level for the given input. The heatmaps are evaluated using the Top right image is the total that we compute by combining the segmentation maps of different lesions. Bottom row shows the explanation heatmaps for the given input. Each explanation heatmap is correlated with the total image marked in red to evaluate the effectiveness of the model towards making the prediction for the disease. We find that IWNE-FS overfits on the hard exudates and also fails to pick up on cues related to microaneurysms. We also find that explanation heatmaps of IFI models show reduced signs of overfitting to a single lesion when compared to IWNE. https://doi.org/10.1371/journal.pone.0274291.g005

PLOS ONE
To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy two pooling strategies mentioned above for each lesion. As a control, we also report the results by replacing explanation heatmaps with random variables from Gaussian distribution. Any method that shows similar results to the control indicates that the heatmaps are just random, i.e. the model looks at random set of input features to make its prediction. In each category (lesion), the best result among the different training strategies are marked in bold for each pooling method.
We find in Table 3 that in the case of microaneurysms, random explanations achieve a mean accuracy of 0.0073 for RMA. Here, the model IWNE-FS achieves results that is very close to the results for the random explanations. On the other hand, all the other models report accuracies that are higher than the corresponding control value. This indicates that IWNE-FS Table 3. Relevance mass accuracy (RMA) and relevance rank accuracy (RRA) on the LRP-α 1 β 0 explanation heatmaps of images of the IDRiD dataset. The results show that while supervised models overfit on the hard exudates, the self-supervised models look at diverse set of input features (lesions). On the other hand, we also find that IFI models show higher accuracies when compared to IWNE models.

PLOS ONE
To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy may be ignoring microaneurysms for making its decision. The RMA results in Table 3 show that for the IWNE models, CL achieves better results. IFI models, in general report higher accuracies than that of IWNE models. Similar to IWNE, we find for IFI models that CL reports better RMA than FS using both the pooling strategies. This is confirmed again with results of RRA in the same table, where models with CL achieves the best results. Microaneurysms are the smallest lesions and it is vital for a method to base its decision on them for detecting progressive cases of DR. Our results indicate that IFI models and CL in particular are better equipped at including microaneurysms to make their predictions.
Haemorrhages are lesions that are slightly larger than microaneurysms. We find in Table 3 that here again IWNE-FS reports similar accuracies to that of the control indicating that this model may be ignoring the haemorrhages as well. Among IWNE models, CL clearly achieves higher RMA as well as higher RRA. This is again the case on the IFI models where CL achieves higher RMA and RRA indicating that the explanations using this model are better correlated with the ground-truth than their supervised counterpart FS.
In contrast to the smaller lesions, the hard exudates are large yellowish white deposits with sharp gradients. Here for RMA, the supervised models achieve better results than the selfsupervised models as shown in Table 3. The results on RRA for hard exudates show that on majority of the cases, for both IWNE and IFI models, the supervised models show higher accuracies than the self-supervised models.
For the total, which measures the sum of the all the different lesions, we find here again that the supervised models achieve better results with RMA as shown in 3. With RRA, the IWNE models do not clearly outperform each other in the case of total. However, for IFI, the selfsupervised model clearly outperforms the supervised model for the total of all the lesions.
The results of RMA and RRA in Table 3 reveal that the supervised models overfit on the hard exudates in both versions of initializations. IWNE-FS in particular fails to base its decision on microaneurysms and haemorrhages that may be highly relevant for the prediction of onset of the disease. The results on the total are skewed by the results on the hard exudates. In alignment with our observations in Section Statistics of Eigenvalues, we find that the IFI models look at diverse set of input features (lesions) and report consistently higher accuracies than their IWNE counterparts. Among IFI, the results of CL correlates better with the explanation heatmaps for a variety of lesions indicating that they look at more diverse set of input features than any other method.

Summary and conclusions
Deep learning-based methods for the diagnosis of diabetic retinopathy have shown remarkable performance. In our paper, we study the important question of the robustness of different training strategies-namely initialization from ImageNet pretraining and self-supervised learning. Our findings are three-fold: Firstly, we show the performance gains obtained by selfsupervised learning in diabetic retinopathy. Secondly, we demonstrate the advantage of selfsupervised learning along with initialization from ImageNet pretraining for diabetic retinopathy by analyzing the statistics of the eigenvalues of the feature representations learned. We also show improvements in robustness to distortions for self-supervised learning in comparison to purely supervised training. Finally, we use interpretability methods to gain quantitative insights into the patterns exploited by models trained using the different training schemes. In particular, we find that initialization from ImageNet pretraining significantly reduces overfitting to large lesions along with improvements in taking into account minute lesions, which are indicative of the progression of the disease.
With our study, we try to convey that a more holistic view on the benefits of pretraining and self-supervision in medical imaging along the lines of the present study is important. To summarize, in absence of large unlabeled domain-specific data that would allow for self-supervised pretraining, we see numerous benefits in favor of using self-supervised pretrained models on ImageNet as starting point for finetuning on domain-specific data, which we put as a general recommendation.